Computer Architecture - Tutorial 1

# E1: H&P (2/e) 1.6 p.61 – individual or groups of 2, 15 mins.

*Problem*

After graduating, you are asked to become the lead computer designer at Hyper Computers, Inc. Your study of usage of high-level language constructs suggests that procedure calls are one of the most expensive operations. You have invented a scheme that reduces the loads and stores normally associated with procedure calls and returns. The first thing you do is run some experiments with and without this optimization. Your experiments use the same state-of-the-art optimizing compiler that will be used with either version of the computer. These experiments reveal the following information:

* The clock rate of the unoptimized version is 5% higher.
* 30% of the instructions in the unoptimized version are loads or stores.
* The optimized version executes 2/3 as many loads and stores as the unoptimized version. For all other instructions the dynamic counts are unchanged.
* All instructions (including load and store) take one clock cycle. Which is faster? Justify your decision quantitatively.

*Solution*

**Given that**:

* Unoptimized clock rate = 1.05 × optimized clock rate (5% higher).
* 30% of unoptimized instructions are loads/stores.
* Optimized version executes 2/3 as many loads/stores.
* All instructions take 1 clock cycle.

CPU performance equation: *CPU Time* = *IC* ∗ *CPI* ∗ *ClockTime*

We know:

*ClockTimeunop* = 0*.*95 ∗ *ClockTimeop*

*ICld/st,unop* = 0*.*3 ∗ *ICunop ICld/st,op* = 0*.*67 ∗ *ICld/st,unop ICothers,unop* = *ICothers,op CPI* = 1

Thus:

*CPUTimeunop* = *ICunop* ∗ *ClockTimeunop* = 0*.*95*ICunop* ∗ *ClockTimeop*

*CPUTimeop* = *ICop* ∗ *ClockTimeop*

but,

*ICop* = 0*.*7 ∗ *ICunop* + 0*.*3 ∗ 0*.*67 ∗ *ICunop* = 0*.*97 ∗ *ICunop*

then,

*CPUTimeop* = 0*.*9 ∗ *ICunop* ∗ *ClockTimeop*

*=* 0*.*95 ∗ *ICunop* ∗ *ClockTimeop*

0*.*9 ∗ *ICunop* ∗ *ClockTimeop*

= 1*.*058

# Comparing 1 and 2 we see that the optimized version is 5.8% faster.

# E2: H&P (2/e) 2.6 p.164 – individual or groups of 2, 10 mins.

*Problem*

Several researchers have suggested that adding a register-memory addressing mode to a load-store machine might be useful. The idea is to replace sequences of:

LOAD Rx,0(Rb)

ADD Ry,Ry,Rx

by

ADD Ry,0(Rb)

Assume this new instruction will cause the clock period of the CPU to increase by 5%. Use the instruction frequencies for the gcc benchmark on the load-store machine from Table 1. The new instruction affects only the clock cycle and not the CPI.

1. What percentage of the loads must be eliminated for the machine with the new in- struction to have at least the same performance?
2. Show a situation in a multiple instruction sequence where a load of a register (Rx) followed immediately by a use of the same register (Rx) in an ADD instruction, could not be replaced by a single ADD instruction of the form proposed.

|  |  |
| --- | --- |
| **Instruction** | **Frequency** |
| load | 22.8% |
| store | 14.3% |
| add | 14.6% |
| sub | 0.5% |
| mul | 0.1% |
| div | 0% |
| compare | 12.4% |
| load imm | 6.8% |
| cond. branch | 11.5% |
| uncond. branch | 1.3% |
| call | 1.1% |
| return, jump ind. | 1.5% |
| shift | 6.2% |
| and | 1.6% |
| or | 4.2% |
| other (xor, not) | 0.5% |

Table 1: Instruction frequencies for gcc (cc1) (from H&P (2/e), Figure 2.26, p.105)

***Solution***

Where; Clock period increases by 5%.

CPU performance equation: *CPUTime* = *IC* ∗ *CPI* ∗ *ClockTime*

We know:

*ClockTimeop* = 1*.*05 ∗ *ClockTimeunop*

*CPIop* = *CPIunop*

## 1.

but,

*CPUTimeop* = *CPUTimeunop*

⇒ *ICop* ∗ 1*.*05 ∗ *ClockTimeunop* = *ICunop* ∗ *ClockTimeunop*

⇒ *ICop* = 0*.*95 ∗ *ICunop*

*ICop* = *ICunop* − *R*

where *R* is the number of instructions removed, then combining 3 and 4,

*ICunop* − *R* = 0*.*95 ∗ *ICunop*

⇒ *R* = 0*.*05 ∗ *ICunop*

but (from figure 2.32, pg. 138),

*ICld,unop* = 0*.*228 ∗ *ICunop*

combining 5 and 6,

*R* = 0*.*05 ∗ 4*.*39 ∗ *ICld,unop*

⇒ *R* = 0*.*219 ∗ *ICld,unop* = 21.9%

21.9% loads must be eliminated to offset the 5% clock penalty.

## 2.

Consider the following code:

LOAD R1, 0(R5)

ADD R1, R1, R1

and that r1 has the initial value 47, r5 has a value of 1000 and that the data value in memory[1000] is 4. Then replacing the code with:

ADD R1, 0(R5)

will give the incorrect result of 51, instead of the correct result 8. In other words, in situations in which Rx and Ry refer to the same register, the replacement will not be semantically equivalent.

# D1: Discussion – in groups of 4, 15 mins.

In the early years of the RISC versus CISC dispute, the total number of different instructions and their variations in the IS was a common indication of the “simplicity” of an ISA (lesser the number, greater the simplicity). Modern RISC instruction sets contain almost as many instructions as old CISC instruction sets. Discuss whether modern “RISC” processors no longer RISC (as envisioned in the 80’s). If they are still RISC, then what features in the instruction set best define the simplicity of an ISA? (e.g., memory access instructions, fixed and simple instruction encoding, register-oriented instructions, simple data types, etc?)

Discussion Points;

RISC Evolution: Modern RISC ISAs (e.g., ARMv9, RISC-V) have added instructions for multimedia, cryptography, etc., but retain core RISC principles:

* + Fixed-length encoding (simplifies pipelining).
  + Load-store architecture (memory access only via LOAD/STORE).
  + Simple addressing modes.

Simplicity Metrics:

* Minimal memory-access instructions.
* Uniform instruction formats.
* Limited operand combinations (e.g., no memory-to-memory operations)

# D2: Discussion – in groups of 4, 10 mins.

Even though the Intel x86 ISA is a clear example of a CISC ISA, modern implementations of it (e.g., Core and Xeon) use many RISC ideas: register-based micro-instructions, pipelining, simple branch micro-instructions, fixed length micro-instructions, etc. Some say that, since at the low level the latest Intel processors *behave* like a RISC, they are RISC. Others say that, since at the software interface (compiler) they are *seen* as CISC, they are CISC. Discuss at what level ISA complexity should be measured. What are the implications of considering the ISA at each level? Are the latest Intel processors RISC?

*Discussion Points;* To aid the discussion, introduce modern approaches for execution of CISC instructions:” cracking” into micro-ops, many physical registers to complement architectural registers. Where;

**Microarchitecture Level**:

* Modern x86 CPUs decode CISC instructions into RISC-like µops (e.g., Intel’s Micro-Ops Fusion).
* Pipelining, out-of-order execution, and speculative branching are RISC techniques.

**Software/Compiler Level**:

* x86 ISA remains CISC (variable-length instructions, memory operands).

**Complexity Measurement**:

* Compiler/Programmer View: CISC (complex ISA).
* Hardware View: RISC (µop execution).